arXiv:1501.07338vl [cs.CV] 29 Jan 2015 


On Vectorization of Deep Convolutional Neural Networks for Vision Tasks 


Jimmy SJ. Ren Li Xu 

Lenovo Research & Technology 

http://vcnn.deeplearning.cc 
jimmy.s j.ren@gmail.com xulihk@lenovo.com 


Abstract 

We recently have witnessed many ground-breaking re¬ 
sults in machine learning and computer vision, gen¬ 
erated by using deep convolutional neural networks 
(CNN). While the success mainly stems from the large 
volume of training data and the deep network architec¬ 
tures, the vector processing hardware (e.g. GPU) undis- 
putedly plays a vital role in modern CNN implemen¬ 
tations to support massive computation. Though much 
attention was paid in the extent literature to understand 
the algorithmic side of deep CNN, little research was 
dedicated to the vectorization for scaling up CNNs. In 
this paper, we studied the vectorization process of key 
building blocks in deep CNNs, in order to better under¬ 
stand and facilitate parallel implementation. Key steps 
in training and testing deep CNNs are abstracted as ma¬ 
trix and vector operators, upon which parallelism can be 
easily achieved. We developed and compared six imple¬ 
mentations with various degrees of vectorization with 
which we illustrated the impact of vectorization on the 
speed of model training and testing. Besides, a unified 
CNN framework for both high-level and low-level vi¬ 
sion tasks is provided, along with a vectorized Mat- 
lab implementation with state-of-the-art speed perfor¬ 
mance. 


Introduction 

Deep convolutional neural network (CNN) has be¬ 
come a keen tool in addressing large scale artifi¬ 
cial intelligence tasks. Though the study of CNN 

can be traced b ack to late 1980s (ILeCun et al. 19891 

ILeCun et al. 1990t . the recent success of deep 
CNN is largely attributed to the concurrent pro¬ 
gresses of the two technical streams. On the 

one hand, the new deep CNN architecture with 
elements such as Dropout (IHinton et al. 2012t 

[Krizhevsky, S utskever, and Hinton 2012| ), DropCon- 

nect (IWan et al. 2013b . Rectified Linear Units-ReLU 
(INair and Hinton 20101) as well as new optimization strate¬ 
gies (IDean et al. 2012b have empowered deep CNN with 
greater learning capacity. On the other hand, the rapid 
advances and democratization of high performance general 
purpose vector processing hardware, typified by graphics 
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processing unit (GPU), unleashes the potential power of 
deep CNN by scaling up the network significantly. 

Various infrastructures were used in scalin g up 
deep CNNs, including GPU (ICoates et al. 2013b , dis- 
tributed CPU based framework (IDean et al. 2012b . FPGA 
(IFarabet et al. 20091) . etc. Though the implementation 
details among those approaches differ, the core in¬ 
sight underlying the idea of scaling up deep CNN 
is parallelization ( [Bengio and LeCun 2007| ) in which 
vectorization technique is the fundamental element. 
While the consecutive distinguished performance of 
GPU trained CNNs in the ImageNet visual recogni¬ 
tion challenge (Kriz hevsky, Sutskever, and Hinton 2012[ 
[Russakovsky et al. 2013| ) as well as the reported results 
in many studies in the literature jus tify its effectiveness 
(IJia et al. 20141 ISermanet et al. 20131) . the published lit¬ 
erature did not provide sufficient insights on how the 
vectorization was carried out in detail. We also found there 
is no previous study to answer how different degrees of 
vectorization influence the performance of deep CNN, 
which is, however, crucial in finding the bottlenecks and 
helps to scale up the network architecture. We believe these 
questions form a significant research gap and the answer to 
these questions shall shed some light on the design, tuning 
and implementation of vectorized CNNs. 

In this paper, we reinterpret the key operators in deep 
CNNs in vectorized forms with which high parallelism can 
be easily achieved given basic parallelized matrix-vector op¬ 
erators. To show the impact of the vectorization on the speed 
of both model training and testing, we developed and com¬ 
pared six implementations of CNNs with various degrees 
of vectorization. We also provide a unified framework for 
both high-level and low-level vision applications including 
recognition, detection, denoise and image deconvolution. 
Our Matlab Vectorized CNN implementation (VCNN) will 
be made publicly available on the project webpage. 


Related Work 

Efforts on speeding up CNN by vectorization starts with 
its inception. Specialized CNN chip (IJackel et al. 1990b was 
built and successfully appli ed to h andwriting recognition in 
the early 90s. Simard et al. (I2QQ31) simplified CNN by fusing 
convolution and pooling operations. This speeded up the net¬ 
work and performed well in document analysis. Chellapilla 






































et al. (12006b adopted the same architecture but unrolled the 
convolution operation into a matrix-matrix product. It has 
now been proven that this vectorization approach works par¬ 
ticularly well with modern GPUs. However, limited by the 
available computing power, the scale of the CNN explored 
at that time was much smaller than modern deep CNNs. 

When deep architecture showed its ability 
to effectively learn highly complex functions 
dHinton, Osindero, and Teh 2006| ), scaling up neural 
network based models was soon becoming one of the major 
tasks in deep learning ( [Bengio and LeCun 2Q07| ). Vector¬ 
ization played an important role in achieving this goal. 
Scaling up CNN by vectorized GPU implementations such 
as Caffe (IJia et al. 2014b . Overfeat (ISermanet et al. 20 lit . 
CudaConvne t Sutskever, and Hinton 2012| ) 

and Theano ( [Bergstraet al. 2010| ) generates state-of-the-art 
results on many vision tasks. Albeit the good performance, 
few of the previous papers elaborated on their vectorization 
strategies. As a consequence, how vectorization affects 
design choices in both model training and testing is unclear. 

Efforts were also put in the acceleration of a part of the 
deep CNN from algorithmic aspects, exemplified by the sep¬ 
arable kernels for convolution (IDenton et al. 20 14b and the 
EFT speedup ( [Mathieu, Henaff, and LeCun 2013| ). Instead 
of finding a faster alternative for one specific layer, we fo¬ 
cus more on the general vectorization techniques used in all 
building blocks in deep CNNs, which is instrumental not 
only in accelerating existing networks, but also in providing 
guidance for implementing and designing new CNNs across 
different platforms, for various vision tasks. 

Vectorization of Deep CNN 

Vectorization refers to the process that transforms the orig¬ 
inal data structure into a vector representation so that the 
scalar operators can be converted into a vector implementa¬ 
tion. In this section, we introduce vectorization strategies for 
different layers in Deep CNNs. 

Figure [T] shows the architecture of a typical deep 
CNN for vision tasks. It contains all of the es¬ 
sential parts of modern CNNs. Comprehensive intro¬ 
ductions on CNN’s general architecture and the re¬ 
cent advances can be found in (ILeCun et al. 1998b and 
( jKrizhevsky, Sutskever, and Hinton 2012) . 

We mark the places where vectorization plays an impor¬ 
tant role, “a” is the convolution layer that transforms the in¬ 
put image into feature representations, whereas “b” is the 
one to handle the pooling related operations, “c” represents 
the convolution related operations for feature maps. We will 
see shortly that the vectorization strategies between “a” and 
“c” are slightly different, “d” involves operations in the fully 
connected network. Finally, “e” is the vectorization opera¬ 
tion required to simultaneously process multiple input sam¬ 
ples (e.g. mini-batch training). It is worth noting that we 
need to consider both forward pass and back-propagation 
for all these operations. 

Vectorizing Convolution 

We refer to the image and intermediate feature maps as /, 
one of the convolution kernels as Wi, the convolution layer 
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Figure 1: Convolutional Neural Network architecture for vi¬ 
sual recognition. 


can be typically expressed as 

f+^=a{wl*f + b\), 


( 1 ) 


where i indexes the kernel. I indexes the layer. U is the 
bias weight. * is the convolution operator. For vision tasks, 
/ can be 2- or 3-dimension. The outputs from previous layer 
can be deemed as one single input fK a is the nonlinear 
function which could be ReLU, hyperbolic tangent, sigmoid, 
etc. Adding bias weight and applying nonlinear mapping 
are element-wise operations which can be deemed as al¬ 
ready fully vectorized, i.e. the whole feature vector can be 
processed simultaneously. Contrarily, the convolution oper¬ 
ators involve a bunch of multiplication with conflict mem¬ 
ory access. Even the operators are parallized for each pixel, 
the parallelism ( |Ragan-Kelley et al. 201 3| ) to be exploited is 
rather limited: compared to the number of computing units 
on GPU, the number of convolution in one layer is usually 
smaller. A fine-grained parallelism on element-wise multi¬ 
plication is much preferred, leading to the vectorization pro¬ 
cess to unroll the convolution. 

In what follows, all the original data /, b and w can be 
viewed as data vectors. Specifically, we seek vectorization 
operators ipd) to map kernel or feature map to its matrix 
form so that convolution can be conducted by matrix-vector 
multiplication. However, a straight forward kernel-matrix, 
image-vector product representation of convolution is not 
applicable here, since the kernel matrix is a sparse block- 
Toeplitz-Toeplitz-block one, not suitable for parallelization 
due to the existence of many zero elements. Thanks to the 
duality of kernel and feature map in convolution, we can 
construct a dense feature-map-matrix and a kernel-vector. 
Further, multiple kernels can be put together to form a ma¬ 
trix so as to generate multiple feature map outputs simulta¬ 
neously. 


( 2 ) 

Operator []i is to assemble vectors with index i to form a 
matrix. 

Backpropagation The training procedure requires the 
backward propagation of gradients through Note 

that (pc{f) is in the unrolled matrix form, different form the 
outputs of previous layer [fWi. An inverse operator 
is thus required to transform the matrix-form gradients into 
the vector form for further propagation. Since ifcQ) is a one- 
to-many mapping, is a many-to-one operator. Fortu¬ 

nately, the gradient update is a linear process which can also 
be processed separately and combined afterwards. 
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Figure 4: Strategy to vectorize pooling. Illustration of pool¬ 
ing for one 4x4 feature map. 


Figure 2: Strategy to vectorize convolution. Illustration of 
the way to covolve a 3x3 input image with three 2x2 kernels 
and generates three 2x2 feature maps. 



Figure 3: Strategy to vectorize convolution with a feature 
map. Illustration of the way to convolve a 3x3x3 feature 
map. 

Matlab Practice Our matlab implementation to vectorize 
the input image is shown in Fig.[2l Specifically, we first crop 
the image patches base on the kernel size and reorganize 
them into columns, as indicated by the dotted bounds. Con¬ 
volution kernels are arranged by rows in another matrix. We 
can see that the product of these two matrices will put all the 
convolved feature maps in the resulting matrix, one feature 
map per row. (^c() here can be efficiently implemented by 
the im2col() functions in Matlab on both GPUQ and CPU. 
We note that the feature map vector here are transpose of fi, 
simply because the function of im2col(). 

Convolution with the feature map An alternative practice 
is needed to handle convolution of the feature map (e.g. “c” 
in Fig. [T]). This is because we need to first combine [fl]i 
to a higher dimensional and then perform convolution. 
One example is illustrated in Fig.O where we need to com¬ 
bine 3 3x3 feature maps to a 3 x 3 x 3 one and apply 
Pci) - In practice, we could first apply three times pci) to fl 
and then combine the results. We found it less efficient since 
the actual number of feature map is much larger. To exploit 
more parallelism, we try to reorganize the data and apply 
only once the vectorization operator Pci)- 

In Fig.O we first reshape column vectors [fi) (a) back to 
a 2D matrix and put them side by side (b). This operation 
is cheap since it just changes the representation of the data 

^We used a custom version of im2col() for GPU. 


but does not rearrange it. It allows us to vectorize all the 
feature maps in a holistic way (c). Since only valid region 
is considered during convolution, we set all the redundant 
columns to zeros. The final pdf) is obtained by rearranging 
the intermediate result in (c). Since redundant columns are 
involved, we use Matlab function accumarrayO to handle 
many-to-one rearrangement. 

One may note that the operator pf^O is a many-to-one 
mapping as well. So it can be efficiently implemented by 
accumarrayO, for backpropagation. 

Vectorizing Pooling 

It is inefficient to carry out the pooling separately for each 
feature map, so the goal here is to simultaneously process 
those separate operations by vectorization. The pooling op¬ 
erator can be abstracted as 

/'+! = a(cpp(f) + &'), (3) 

where ppQ is 3. many-to-one mapping with a defined opera¬ 
tion corresponding to max- or average- pooling. 

Due to the information loss, the inverse operator p~^() is 
not well defined. We simply use nearest neighbor upscaling 
for approximation during backpropagation. 

The pooling operations, both average pooling and max 
pooling, can be thought of as a vector accumulation process 
guided by a pre-defined index map, which can be similarly 
implemented by accumarray(). The only difference for max 
pooling is a max function is involved. For overlapping pool¬ 
ing, we could insert the overlapped elements into the feature 
map and then apply the same pooling strategy. 

Vectorizing Fully Connected Layers 

The fully connected layers (i.e. “d” in Fig.[T]), can be written 
in a dense matrix-matrix multiplication. Thus both the feed 
forward and backpropagation are naturally vectorized, in a 
unified matrix-matrix form. 

Vectorization for Mini-batches 

Our vectorization strategy can be directly extended to sup¬ 
port mini-batch training. Given a batch of samples indexed 
by j, the mini-batch training with a convolution layer is 
given by 

= + (4) 

where []j is to assemble the matrix of different samples. 

Figure [5] shows the Matlab implementation of batch mode 
for the same operation as in Fig. [2] with the batch size of 2. 
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Figure 5: Strategy to vectorize mini-batch operations. 


Both samples in the input batch are vectorized and the out¬ 
puts are arranged horizontally. We can show that the prod¬ 
uct of the same kernel matrix as in Fig. [2] and this matrix is 
able to simultaneously generate feature maps for both sam¬ 
ples. Note that if an input sample of a convolutional layer 
has multiple channels, we could treat it as a multi-channel 
feature map as shown in Fig.O 

Experiments and Analysis 

The goal of the experiments presented in this section is to 
understand the role of vectorization in training and testing 
CNNs as well as its limitation. 

In order to make our experiment results of high validity 
and relevancy, we compared the training and testing speed 
of our fully vectorized implementation with Caffe and Cu- 
daconvnet, the speed is competitive, if not faster in all the 
tested cases. 

Comparing Different Degrees of Vectorization 

In this section, we seek to understand the role of vectoriza¬ 
tion by comparing six CNN implementations. These imple¬ 
mentations differ in the degree of vectorization, which is il¬ 
lustrated in table 1. Imp-1 is a least vectorized one, Imp-2 
naively parallelize the process of batch samples by adding a 
parallel for-loop to Imp-1 whilst Imp-6 is a fully vectorized 
implementation guided by the approaches we introduced. 
When working with one particular implementation we also 
observe how results change with network scales. The reason 
is we would like to examine different vectorization strate¬ 
gies with both small scale network and large scale ones and 
we are particularly interested in large scale CNNs since it is 
more relevant to recent advances in the field. 

We consider three scales. Scale 1 i s a small model, i t 
is very similar to the standard LeNet (ILeCun et al. 1998b . 
but with more feature maps in the convolutional lay- 
erfl ReLU nonlinearity and cross-entropy error. Scale 
2 is a large network with about 60 million train- 
able parameters which is comparable to AlexNet 
( jKrizhevsky, Sutskever, and Hinton 2012) . However, 

the architecture is tailored for the purpose of this study. 
First, we would like to keep the number of conv layer and 
the number of fully connected layer balanced so that we 
shall have a fair performance breakdown. It also allows 
us to directly compare the results with Scale 1. Second, 


Vec ele Fu-co Conv 

Imp-6 / / 

Imp-5 / / 

Imp-4 / / 

Imp-3 / / 

Imp-2 / 

Imp-1 / 


Pool Feat Batch 

~V 7 

/ / 

/ 

/ 


Table 1: Different vectorized elements included in the six 
CNNs. Fu-co: fully connected layer, Conv: convolutional 
layer. Pool: pooling layer. Feat: feature map pachification. 
Batch: vectorize for batch. A tick indicates the element is 
vectorized 


to enable a fair comparison, we would like to have a 
unified vectorization scheme for convolution throughout 
the network. Thus unlike AlexNet which uses a stride 4 
convolution in the first conv layer and stride 1 thereafter, 
all convolution operations use the same stride of 1 in our 
model. The consequences of those are, compare to AlexNet, 
scale 2 tends to have more feature maps in the conv layer 
but smaller size for the input images. We set the number 
of output units to 1000 as in AlexNet. Scale 3 is a larger 
network with 10,000 output units. This pushes the number 
of trainable parameters to 94 million, keeping other settings 
the same as scale 2. The performance on GPlfl (in terms 
of the number of images to be processed per second) of the 
six CNNs with different network scales during training is 
illustrated in table 2. 

We are able to observe several insights from the figures. 
First of all, the results indicates that vectorization is vital to 
CNN’s speed. We can see that Imp-1 is very slow and a naive 
parallelization (Imp-2) seems work poorly on GPU. Espe¬ 
cially for the large scale networks, Imp-1 and Imp-2 are sim¬ 
ply too slow to be practical. When training a small network, 
a fully vectorized CNN (Imp-6) is more than 200 times 
faster than the naive parallelization version during training 
and more than 100 times faster during testing. This accel¬ 
eration is going to be more significant for bigger networks 
since Imp-1 and 2 scale poorly. 

Second, all vectorization element we introduced con¬ 
tribute significantly to the final performance, during both 
training and testing. One interesting insight is the contribu¬ 
tion of vectorizing pooling and feature map patchification 
seems to increase with the scale of the network. For instance, 
in table 2, Imp-4 (vectorize for pooling) has a 1.9x speed up 
than Imp-3 under scale 1 but a 4.5x speed up and a 4.3x 
speed up under scale 2 and scale 3 respectively. Same phe¬ 
nomenon happens for testing. This strongly indicates that 
the vectorization strategy for those two elements scales well 
with the size of the network. 

On the other hand, we also observe that vectorizing batch 
processing brings more than lOx speed up for small models 
but only 3x to 5x speed up for large scale models. The con¬ 
tribution of vectorizing batch processing to the performance 
seems to decrease when scaling up the network though the 


^This setting of conv layer is the same in Caffe’s MNIST exam¬ 
ple. We have 2 fully connected hidden layers. 


^GeForce GTX 780 Ti, same card was used in the rest of the 
experiments. 































#img/sec 

Imp-1 

Imp-2 

Imp-3 

Imp-4 

Imp-5 

Imp-6 

Scale 1 

1 

6.1 

15.3 

29.5 

85.4 

1312.1 

Scale 2 

n/a 

n/a 

2.4 

11 

42.3 

188.7 

Scale 3 

n/a 

n/a 

2.3 

10 

34.3 

161.2 


* batch size is 100 for scale 1 and 200 for scale 2 and 3. 


Table 2: Training performance of the six CNNs (#images to 
be processed per second). Scale 1: small model, 10 output 
units; Scale2: large model, 1000 output units, ScaleS: larger 
model, 10000 output units. 


■ conv_f ■ conv_b 

■ pool_f ■ pool_b 

■ full_f ■ full_b 

■ other_f ■ other_b 


3% 3% 



■ conv_f ■ conv_b 

■ pool_f ■ pool_b 

■ full_f ■ full.b 

■ other_f ■ other_b 



■ conv_f ■ conv_b 

■ pool_f ■ pool_b 

■ full_f ■ full_b 

■ other_f ■ other_b 


1 % 1 % 



(a) (b) (c) 


speed up remains significant. We further investigate this phe¬ 
nomenon in the next section which leads to a strategy to 
achieve optimal training and testing speed. 

In Search of Optimal Speed 

We investigated the puzzle of decelerating speed up by scru¬ 
tinizing the performance against different batch sizes. The 
results are presented in table 3 and 4 for training and testing 
respectively. 


#img/sec 

b=l 

b=100 

b=200 

b=300 

b=400 

Scale 1 

88.5 

1312 

1450.9 

1574.2 

1632.8 

Scale 2 

41.9 

136.9 

188.7 

192.3 

106.3 

Scale 3 

34.3 

123.5 

161.3 

163.9 

91 

Table 3: Training performance of Imp-6 against different 
batch sizes (#images to be processed per second). 


#img/sec 

b=l 

b=100 

b=200 

b=400 

b=600 

Scale 1 

151.5 

1812.6 

1878.4 

2023.5 

2192.2 

Scale 2 

75.8 

222.2 

270.2 

285.7 

103.1 

Scale 3 

74 

212.8 

256.4 

277.8 

89.2 


Table 4: Test performance of Imp-6 against different batch 
sizes (#images to be processed per second). 

In table 3, we can see that for the small model (scale 1) the 
acceleration brought by each adjacent batch size increase is 
14x, l.lx, 1.08x and 1.03x. The acceleration obtained via 
the increase of batch size seems to be rapidly vanishing. 
For the large model (scale 2), the first three acceleration ra¬ 
tio are 3.2x, 1.3x and 1.02x, demonstrating the same van¬ 
ishing trend. Further increase in batch size even leads to a 
performance degradation instead. Same situation occurs for 
the larger model (scale 3). Though the ability of processing 
192 images/second for training and 285 images/second for 
testing with our commodity GPU for the scale 2 network is 
promising, this result still indicates that there is some scal¬ 
ing limitation within the vectorization for batch processing. 
Similar results in table 4 seems to further suggest that such 
limitation is shared between training and testing. In order to 
completely understand the rationale under the hood, we have 
to resort to a detailed performance breakdown. 

Performance Breakdown and Limitation. We decom¬ 
pose the whole training procedure into the following com¬ 
ponents. They are 1) conv layers; 2) pooling layers; 3) fully 
connected layers; 4) others (e.g. ReLU, cost). We distinguish 


Figure 6: Performance break down, (a) Scale 3 network, 
batch size = 1. (b) Scale 3 network, batch size = 200. (c) 
Scale 3 network, batch size = 300. conv: conv layers, pool: 
pooling layers, full: fully connected layers, other: other op¬ 
erations, _/: forward pass, J?: back-propagation. 

the statistics between forward pass and back-propagation, 
therefore 8 components to look at. 

Figure 6 illustrates the performance break down (in terms 
of the proportion of computing time in processing one batch) 
during training of the two representative cases from our 
largest network (scale 3) in the experiment. Batch size is 
1 for Fig. 6(a), 200 for Fig. 6(b) and 300 for Fig. 6(c). We 
can observe from Fig. 6(a) that 44% of the overall time was 
used in processing the fully connected layers. It was in fact 
the biggest consumer of the computing time for this batch 
size. We also see that the time spent on fulLb is significantly 
more than fulLf. This makes sense because it involves larger 
matrix multiplication and larger transform matrix than that 
in the forward pass. The second largest consumer of time is 
the convolution layers and we can see that the time spent in 
forward pass and back-propagation is reasonably balanced. 

However, the situation we found in Fig. 6(b) and Fig. 6(c) 
is very different. One obvious character is, when increasing 
the batch size, the time costs by the conv layers is now con¬ 
siderably more than the fully connected layers. While the 
proportion between fulLf and fulLb among the three batch 
sizes roughly remains the same, we found conv_f spent much 
more time than conv_b for large batch sizes. This indicates 
the scaling limitation is within the conv_f when vectoriz¬ 
ing for batch processing. A further scrutiny on this issue 
shows that the limitation is caused by the following two fac¬ 
tors namely the memory overhead in handling multiple sam¬ 
ples and the overhead caused by invoking patchification on 
bigger samples. While there might be alternative strategies 
to vectorize batch processing, we argue that the aforemen¬ 
tioned overhead is hard to be completely avoided. 

Finding the Optimal Speed. We found the observations 
from Fig. 6 are also valid for scale 1 and scale 2 networks, 
but with an important difference. For small networks like the 
scale 1 network, the acceleration brought by batch process¬ 
ing shall be valid for very big batch sizes (e.g. 1000) whilst 
for large networks batch size needs to be chosen carefully 
or else the speed degradation like we saw in table 3 and 4 
shall occur before the network hits the GPU memory ceil¬ 
ing. This suggests that given a network design choosing an 
appropriate batch size may be vital in achieving the opti- 






(a) (b) 


Figure 7: Speed of 10 randomly selected networks. X axis, 
batch size. Y axis, number of images to be processed per 
second, (a) for training, (b) for testing. 


mal speed. Based on our scale 2 network, we select 10 other 
networks by randomly adjusting several parameters such as 
filter size, number of feature maps, number of output units 
and sigmoid function, etc. We run these networks for both 
training and testing by adjusting the batch sizes to see if this 
contention is generally applicable for large networks. 

Figure 7 confirms our aforementioned contention for large 
networks and makes the importance of choosing an appro¬ 
priate batch size obvious. First, it suggests that the optimal 
batch size among different network parameters is usually 
quite different. Directly adopting a batch size from a previ¬ 
ous set of network parameters may lead to significantly in¬ 
ferior speed. Second, it also suggests that the optimal batch 
size between the training stage and the testing stage is also 
different, even if for the same network. A naive adoption 
of the batch size from the training stage is often not opti¬ 
mal and leads to considerable speed loss. These findings has 
direct implications in building real-time systems in which 
optimization for model testing is the key. 

Unification of High/Low Level Vision Tasks 

Despite the rapid adoption of deep CNN in addressing 
various kinds of high level computer vision tasks typified 
by image classification and object localization, other prob¬ 
lems such as detecting objects of different shapes in real¬ 
time seem still a problem under investigation. On the other 
hand, we observed that there are a few very recent stud¬ 
ies (IXu et al. 2014t [Eigen, Krishnan, and Fergus 2013| ) suc¬ 
cessfully used deep CNN in various low level vision tasks 
such as image deblurring and denoising, etc. Though the do¬ 
main knowledge required to build those new networks sub¬ 
stantially differ from that used in addressing high level vi¬ 
sion tasks, same vectorization principles presented in this 
paper will apply. 

More interestingly, the same vectorization principle 
across those tasks actually gives us a chance (perhaps for 
the first time) to unify both high level vision tasks and low 
level vision tasks in a single computational framework. In 
this section, we introduce the application of our VCNN im¬ 
plementation in tasks seemingly of distinct fields namely, 
image denoising and deblurring (low level vision) as well as 
multi-object detection (high level vision). 


CNN for Image Processing 

Image processing tasks do not require pooling and fully con¬ 
nected layers in general. To verify the effectiveness of the 
proposed vectorized framework, we implemented a network 
architecture by simply removing the pooling and fully con¬ 
nected layers from Fig. [T] and trained the network with syn¬ 
thesized clear-noisy image pairs. One of the denoise result 
is given in Fig. [8l Another sample application of our vec¬ 
torized CNN is the recent proposed image deconvolution 
(IXu et al. 2Q14b . Result is shown in Fig. [51 



Figure 8: Application in image denoising. 



Figure 9: Application in image deconvolution. 


Novel Training Scheme for Multi-object Detection 

Conventional image classifiers are usually trained by image 
samples with equal sizes. This imposes a critical limitation 
when applying it in detection. For instance, it is reasonable 
to put a human face sample in a square image, but doing so 
for non-squared objects (e.g. shoes) tends to include more 
background content thus introduces more noise which is 
detrimental to accurate and efficient object detection. One 
possible alternative is to formulate object detection as a re¬ 
gression problem ( |S^gedy, Toshev, and Erhan 2013| ), how¬ 
ever, it requires a very large amount of data and usually very 
big models to capture the variety of the possible patterns. 



Figure 10: Application in real time multi-object detection. 
Shoes review videos are from Youtube. 

































Using VCNN, we were able to train a single image clas¬ 
sifier but with heterogeneous input sizes by using vectoriza- 
tion. The key insight is heterogeneous inputs can actually 
share all the weights in a CNN except the ones in the con¬ 
nection between conv layer and fully connected layer. This 
approach not only avoids the background noise but also be¬ 
ing a lot more lightweight than the regression approach. We 
successfully applied it in a detection system which runs in 
real-time. We can show that this approach tends to have less 
false alarms and works efficiently with multi-scale detection 
through vectorization. 

Conclusion 

In this paper, we elaborate several aspects on vectorization 
of deep CNN. First, we present the vectorization steps of 
all essential parts of implementing deep CNNs. The vec¬ 
torization steps are further exemplified by Matlab practices. 
Second, we have developed and compared six CNN imple¬ 
mentations with different degrees of vectorization to anal¬ 
ysis the impact of vectorization on speed. Third, based on 
the practices, we provide a unified framework for handling 
both low-level and high-level vision tasks. Experiments on 
various applications including image denoise, decovolution 
and real-time object detection demonstrated the effective¬ 
ness of the proposed strategies. As the introduced vector¬ 
ization techniques are general enough, our future direction 
includes optimization for different hardware or cloud plat¬ 
forms. 
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